69 research outputs found

    Performance Characterization of In-Memory Data Analytics on a Modern Cloud Server

    Full text link
    In last decade, data analytics have rapidly progressed from traditional disk-based processing to modern in-memory processing. However, little effort has been devoted at enhancing performance at micro-architecture level. This paper characterizes the performance of in-memory data analytics using Apache Spark framework. We use a single node NUMA machine and identify the bottlenecks hampering the scalability of workloads. We also quantify the inefficiencies at micro-architecture level for various data analysis workloads. Through empirical evaluation, we show that spark workloads do not scale linearly beyond twelve threads, due to work time inflation and thread level load imbalance. Further, at the micro-architecture level, we observe memory bound latency to be the major cause of work time inflation.Comment: Accepted to The 5th IEEE International Conference on Big Data and Cloud Computing (BDCloud 2015

    Code Generation and Run-time Support For Multi-Level Parallelism . . .

    No full text
    In this paper we describe the main components of the NanosCompiler, an OpenMP compiler whose implementation is oriented towards the efficient exploitation of nested parallelism. Program parallelization relies both on the automatic parallelization capabilities of the base compiler and the information obtained from user--supplied directives. The compiler uses a hierarchical internal representation that unifies both sources of parallelism, proceeds with a task identification phase that adapts the granularity of the final tasks to the target architecture and then generates parallel code. The paper also presents an analysis of the special support needed from the threads library level to support this kind of parallelism. These requirements are analyzed in our current implementation named NthLib

    Towards an efficient exploitation of loop-level parallelism in Java

    No full text
    This paper analyzes the overheads incurred in the exploitation of loop-level parallelism using Java Threads and purposes some code transformations that minimize them. Avoiding the intensive use of Java Threads and reducing the number of classes used to specify the parallelism in the application results in promising performance gains that may encourage the use of Java for exploiting loop-level parallelism. On average, the execution time for our synthetic benchmarks is reduced by 50% from the simplest transformation when 8 threads are used

    Task-based Parallel Breadth-First Search in Heterogeneous Environments

    No full text
    Abstract—Breadth-first search (BFS) is an essential graph traversal strategy widely used in many computing applications. Because of its irregular data access patterns, BFS has become a non-trivial problem hard to parallelize efficiently. In this paper, we introduce a parallelization strategy that allows the load balancing of computation resources as well as the execution of graph traversals in hybrid environments composed of CPUs and GPUs. To achieve that goal, we use a fine-grained task-based parallelization scheme and the OmpSs programming model. We obtain processing rates up to 2.8 billion traversed edges per second with a single GPU and a multi-core processor. Our study shows high processing rates are achievable with hybrid environments despite the GPU communication latency and memory coherence. I

    Employing Nested OpenMP for the Parallelization of Multi-Zone Computational Fluid Dynamics Applications

    No full text
    In this paper we describe the parallelization of the multi-zone code versions of the NAS Parallel Benchmarks employing multi-level OpenMP parallelism. For our study we use the NanosCompiler, which supports nesting of OpenMP directives and provides clauses to control the grouping of threads, load balancing, and synchronization. We report the benchmark results, compare the timings with those of different hybrid parallelization paradigms and discuss OpenMP implementation issues which effect the performance of multi-level parallel applications
    • …
    corecore